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Abstract 

Explanation-based generalization is used to extract a 
specialized grammar from the original one using a train- 
ing corpus of parse trees. This allows very much faster 
parsing and gives a lower error rate, at the price of a 
small loss in coverage. Previously, it has been necessary 
to specify the tree-cutting criteria (or operationality cri- 
teria) manually; here they are derived automatically 
from the training set and the desired coverage of the 
specialized grammar. This is done by assigning an en- 
tropy value to each node in the parse trees and cutting 
in the nodes with sufficiently high entropy values. 

BACKGROUND 

Previous work by Manny Rayner and the author, see 
[Samuelsson & Rayner 1991] attempts to tailor an ex- 
isting natural-language system to a specific application 
domain by extracting a specialized grammar from the 
original one using a large set of training examples. The 
training set is a treebank consisting of implicit parse 
trees that each specify a verified analysis of an input 
sentence. The parse trees are implicit in the sense that 
each node in the tree is the (mnemonic) name of the 
grammar rule resolved on at that point, rather than the 
syntactic category of the LHS of the grammar rule as is 
the case in an ordinary parse tree. Figure [l] shows five 
examples of implicit parse trees. The analyses are ver- 
ified in the sense that each analysis has been judged to 
be the preferred one for that input sentence by a human 
evaluator using a semi-automatic evaluation method. 

A new grammar is created by cutting up each implicit 
parse tree in the treebank at appropriate points, creat- 
ing a set of new rules that consist of chunks of original 
grammar rules. The LHS of each new rule will be the 
LHS phrase of the original grammar rule at the root of 
the tree chunk and the RHS will be the RHS phrases of 
the rules in the leaves of the tree chunk. For example, 
cutting up the first parse tree of Figure [I] at the NP of 
the rule vp_v_np yields rules 2 and 3 of Figure [| 



The idea behind this is to create a specialized gram- 
mar that retains a high coverage but allows very much 
faster parsing. This has turned out to be possible - 
speedups compared to using the original grammar of 
in median 60 times were achieved at a cost in cover- 
age of about ten percent, see [ |5amuelsson 1994a] . 1 ! An- 
other benefit from the method is a decreased error rate 
when the system is required to select a preferred ana- 
lysis. In these experiments the scheme was applied to 
the gra mmar of a version of the SRI Core Language 
Engine [ Alshawi ed. 1992 1 ada pted to the Atis do main 
for a speech-translation task Rayner et al 199*3] and 
large corpora of real user data collected using Wizard- 
of-Oz simulation. The resulting specialized gram- 
mar was compiled into LR parsing tables, and a spe- 
cial LR parser exploited their special properties, see 
Samuelsson 1994b| . 



The technical vehicle previously used to extract the 
specialized grammar is explanation-based generaliza- 
tion (EBG), see e.g. fVlitchell et al 1986fl . Very briefly, 
this consists of redoing the derivation of each train- 
ing example top-down by letting the implicit parse tree 
drive a rule expansion process, and aborting the expan- 
sion of the specialized rule currently being extracted if 
the current node of the implicit parse tree meets a set 
of tree-cutting criteria^. In this case the extraction pro- 
cess is invoked recursively to extract subrules rooted in 
the current node. The tree-cutting criteria can be local 
("The LHS of the original grammar rule is an NP") or 
dependent on the rest of the parse tree ("that doesn't 
dominate the empty string only," ) and previous choices 
of nodes to cut at ( "and there is no cut above the cur- 
rent node that is also labelled NP."). 

A problem not fully explored yet is how to arrive 
at an optimal choice of tree-cutting criteria. In the 
previous scheme, these must be specified manually, and 



1 Other more easily obtainable publications about this are 
in preparation. 

2 These are usually referred to as "operationality criteria" 
in the EBG literature. 



the choice is left to the designer's intuitions. This article 
addresses the problem of automating this process and 
presents a method where the nodes to cut at are selected 
automatically using the information-theoretical concept 
of entropy. Entropy is well-known from physics, but the 
concept of perplexity is perhaps better known in the 
speech-recognition and natural- language communities. 
For this reason, we will review the concept of entropy 
at this point, and discuss its relation to perplexity. 

Entropy 

Entropy is a measure of disorder. Assume for exam- 
ple that a physical system can be in any of N states, 
and that it will be in state s, with probability p^. The 
entropy S of that system is then 
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If each state has equal probability, i.e. if pi = -k for all 
i, then 



JV 



* = E 



i=l 



1 

'n 



111 



N 



In N 



In this case the entropy is simply the logarithm of the 
number of states the system can be in. 

To take a linguistic example, assume that we are try- 
ing to predict the next word in a word string from the 
previous ones. Let the next word be w k and the pre- 
vious word string w\, u>k_i. Assume further that 
we have a language model that estimates the proba- 
bility of each possible next word (conditional on the 
previous word string). Let these probabilities be pi 
for i — 1,...,N for the N possible next words w\, 
i.e. pi — p(w\ | W\, The entropy is then a 
measure of how hard this prediction problem is: 

S(wi, ...,w k -i) = 

N 

E ~P( w k I wi, -\np(wl | wx,...,w k -i) 

4=1 

If all words have equal probability, the entropy is the 
logarithm of the branching factor at this point in the 
input string. 

Perplexity 

Perplexity is related to entropy as follows. The observed 
perplexity P of a language model with respect to an 
(imaginary) infinite test sequence w\,W2,--- is defined 
through the formula (see | Jclinck 1990| ) 



Here p(w\, ...,w n ) denotes the probability of the word 
string w%, w n . 

Since we cannot experimentally measure infinite lim- 
its, we terminate after a finite test string w\, Wm, 
arriving at the measured perplexity P m : 

In P m = --^In P(wi, —,Wm) 

Rewriting p(w 1 ,...,w k ) as p{w k | w u w k -i) ■ 
p(wi, Wk-i) gives us 

1 M 

In P m = tj E _ln p ( Wk I Wl > ■■■i w k-i) 
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Let us call the exponential of the expectation value of 
—In p(w | String) the local perplexity Pi (String), which 
can be used as a measure of the information content of 
the initial String. 

In P i (w 1 , ...,w k -i) = E(-ln p(£ k | wi, w k -i)) = 

N 

^ -p(w l k | wi, Wk-i) ■ In p(wl | wi, ...,w k -i) 

i=i 

Here E(ry) is the expectation value of r\ and the sum- 
mation is carried out over all N possible next words w\. 
Comparing this with the last equation of the previous 
section, we see that this is precisely the entropy S at 
point k in the input string. Thus, the entropy is the 
logarithm of the local perplexity at a given point in the 
word string. If all words are equally probable, then the 
local perplexity is simply the branching factor at this 
point. If the probabilities differ, the local perplexity 
can be viewed as a generalized branching factor that 
takes this into account. 

Tree entropy 

We now turn to the task of calculating the entropy of a 
node in a parse tree. This can be done in many different 
ways; we will only describe two different ones here. 

Consider the small test and training sets of Figure [l]. 
Assume that we wish to calculate the entropy of the 
phrases of the rule PP — > Prep NP, which is named 
pp_prep_np. In the training set, the LHS PP is at- 
tached to the RHS PP of the rule np_np_pp in two 
cases and to the RHS PP of the rule vp_vp_pp in one 
case, giving it the entropy — §hi| — |ln| ~ 0.64. The 
RHS preposition Prep is always a lexical lookup, and 
the entropy is thus zercj], while the RHS NP in one case 

3 Since there is only one alternative, namely a lexical 
lookup. In fact, the scheme could easily be extended to en- 
compass including lexical lookups of particular words into 
the specialized rules by distinguishing lexical lookups of dif- 
ferent words; the entropy would then determine whether or 
not to cut in a node corresponding to a lookup, just as for 
any other node, as is described in the following. 



Training examples : 



s_np_vp 
A 

np_pron vp_v_np I 
I /\ 
lex lex np_det_n 
I I A 

I want lex lex 
I I 
a ticket 



s_np_vp 
A 

np_pron vp_v_np 
I /\ 
lex / \ 

I / \ 

I lex np_np_pp 
I A 
need np_det_n pp_prep_np 
/\ ' A 

lex lex lex lex 
II II 
a flight to Boston 



s_np_vp 
A 

np_pron vp_v_np 
I /\ 
lex / \ 

I / \ 

We lex 
I 

have 



s_np_vp 
/\ 
/ \ 

np_det_n vp_vp_pp 

/\ A 
lex lex vp_v pp_prep_np 
I I I A 

The flight lex lex np_num 
I I I 

np_np_pp departs at lex 

A I 
/ \ ten 



np_det_n pp_prep_np 

A /\ 
lex lex lex np_det_n 

II I A 

a departure I lex lex 
in I I 

the morning 

Test example : 

s_np_vp 
A 

np_pron vp_v_np 
I A 
lex / \ 

I / \ 

He lex np_np_pp 
I A 
booked / \ 

np_det_n pp_prep_np 
/\ " /\ 

lex lex / \ 

II / \ 

a ticket lex np_np_pp 
I A 
I np_det_n pp_prep_np 
for A A 

lex lex lex lex 
I I I I 

a flight to Dallas 



Figure 1: A tiny training set 



attaches to the LHS of rule np_det_np, in one case to 
the LHS of rule np_num, and in one case is a lexical 
lookup, and the resulting entropy is thus — ln^ w 1.10. 
The complete table is given here: 



Rule 


LHS 


1st RHS 


2nd RHS 


s_np_vp 


0.00 


0.56 


0.56 


np_np_pp 


0.00 


0.00 


0.00 


np_det_n 


1.33 


0.00 


0.00 


np_pron 


0.00 


0.00 




np_num 


0.00 


0.00 




vp_vp_pp 


0.00 


0.00 


0.00 


vp_v_np 


0.00 


0.00 


0.64 


vp_v 


0.00 


0.00 




pp_prep_np 


0.64 


0.00 


1.10 



If we want to calculate the entropy of a particular 
node in a parse tree, we can either simply use the phrase 
entropy of the RHS node, or take the sum of the en- 
tropies of the two phrases that are unified in this node. 
For example, the entropy when the RHS NP of the 
rule pp_prep_np is unified with the LHS of the rule 
np_det_n will in the former case be 1.10 and in the 
latter case be 1.10 + 1.33 = 2.43. 

SCHEME OVERVIEW 

In the following scheme, the desired coverage of the spe- 
cialized grammar is prescribed, and the parse trees are 
cut up at appropriate places without having to specify 
the tree-cutting criteria manually: 

1. Index the treebank in an and-or tree where the or- 
nodes correspond to alternative choices of grammar 
rules to expand with and the and-nodes correspond 
to the RHS phrases of each grammar rule. Cutting 
up the parse trees will involve selecting a set of or- 
nodes in the and-or tree. Let us call these nodes 
"cutnodes" . 

2. Calculate the entropy of each or-node. We will cut at 
each node whose entropy exceeds a threshold value. 
The rationale for this is that we wish to cut up the 
parse trees where we can expect a lot of variation 
i.e. where it is difficult to predict which rule will be 
resolved on next. This corresponds exactly to the 
nodes in the and-or tree that exhibit high entropy 
values. 

3. The nodes of the and-or tree must be partitioned 
into equivalence classes dependent on the choice of 
cutnodes in order to avoid redundant derivations at 
parse time.^J Thus, selecting some particular node as 

4 This can most easily be seen as follows: Imagine two 
identical, but different portions of the and-or tree. If the 
roots and leaves of these portions are all selected as cut- 



a cutnode may cause other nodes to also become cut- 
nodes, even though their entropies are not above the 
threshold. 



4. Determine a threshold entropy that yields the desired 
coverage. This can be done using for example interval 
bisection. 

5. Cut up the training examples by matching them 
against the and-or tree and cutting at the determined 
cutnodes. 

It is interesting to note that a textbook method 
for constructing decision trees for classification from 
attribute- value pairs is to minimize the (weighted aver- 
age of the) remaining entropyfl over all possible choices 
of root attribute, see Quinlan 1986 1. 



DETAILED SCHEME 

First, the treebank is partitioned into a training set and 
a test set. The training set will be indexed in an and- 
or tree and used to extract the specialized rules. The 
test set will be used to check the coverage of the set of 
extracted rules. 

Indexing the treebank 

Then, the set of implicit parse trees is stored in an and- 
or tree. The parse trees have the general form of a rule 
identifier Id dominating a list of subtrees or a word of 
the training sentence. From the current or-node of the 
and-or tree there will be arcs labelled with rule iden- 
tifiers corresponding to previously stored parse trees. 
From this or-node we follow an arc labelled Id, or add 
a new one if there is none. We then reach (or add) 
an and-node indicating the RHS phrases of the gram- 
mar rule named Id. Here wc follow each arc leading 
out from this and-node in turn to accommodate all the 
subtrees in the list. Each such arc leads to an or-node. 
We have now reached a point of recursion and can index 
the corresponding subtree. The recursion terminates if 
Id is the special rule identifier lex and thus dominates 
a word of the training sentence, rather than a list of 
subtrees. 

Indexing the four training examples of Figure [l] will 
result in the and-or tree of Figure ||. 

Finding the cutnodes 

Next, we find the set of nodes whose entropies exceed a 
threshold value. First we need to calculate the entropy 
of each or-node. We will here describe three different 



nodes, but the distribution of cutnodes within them differ, 
then we will introduce multiple ways of deriving the portions 
of the parse trees that match any of these two portions of 
the and-or tree. 

5 Defined slightly differently, as described below. 
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I s_np_vp 
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/ 

nl(0.89) 
A 

np_pron/ \np_det_n 
/ \ 
II 1A2 / 
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lex I lex I Ilex / 
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vp_v_np/ 
/ 
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A 
1/ \2 
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n2(0.56) 
A 
/ \ 
/ \ 

\ 

\ 

\ 

\ 



/ 



\ 



\vp_vp_pp 
\ 

\ 

\ 

A 
1/ \2 
/ \ 



n 

lex I 



n3(1.08) (0.00)n7 



A vp_v| 
/ \ II 
\np_np_pp n 
\ lex I lex I 

\ 

A 

/ \ 

\2 
\ 

\ 

n5(0.64) 

I pp_prep_np 
/ \ 
1/ \2 
Ilex / \ 

n n6(1.76) 
lex I A 

lex/ \np_det_n 
/ \ 
1A2 
n n 
lex I Ilex 



Figure 2: The resulting and-or tree 



np_det_n/ 
/ 

/ 

1A2 
n n 
lex I Ilex 1/ 
/ 

/ 

(1.33)n4 
np_det_n I 

1/ \2 
n n 
lex I 



n8(0.64) 
I pp_prep_np 
1/ \2 

n n9(1.10) 
I np_num 
II 
n 

Ilex 



ways of doing this, but there are many others. Before 
doing this, though, we will discuss the question of re- 
dundancy in the resulting set of specialized rules. 

We must equate the cutnodes that correspond to the 
same type of phrase. This means that if we cut at a 
node corresponding to e.g. an NP, i.e. where the arcs 
incident from it are labelled with grammar rules whose 
left-hand-sides are NPs, we must allow all specialized 
NP rules to be potentially applicable at this point, not 
just the ones that are rooted in this node. This requires 
that we by transitivity equate the nodes that are dom- 
inated by a cutnodc in a structurally equivalent way; if 
there is a path from a cutnode c\ to a node n\ and a 
path from a cutnode C2 to a node 712 with an identical 
sequence of labels, the two nodes ni and rii must be 
equated. Now if n\ is a cutnode, then ni must also 
be a cutnode even if it has a low entropy value. The 
following iterative scheme accomplishes this: 



Function N*{N°) 

1. i := 0; 

2. Repeat i := i + 1; N* 

3. Until N 1 = N' 1 - 1 

4. Return N l ; 



N(N' 



Here N(N : >) is the set of cutnodes augmented with 
those induced in one step by selecting as the set of 
cutnodes. In practice this was accomplished by compil- 
ing an and-or graph from the and-or tree and the set 
of selected cutnodes, where each set of equated nodes 
constituted a vertex of the graph, and traversing it. 

In the simplest scheme for calculating the entropy of 
an or-node, only the RHS phrase of the parent rule, 
i.e. the dominating and-node, contributes to the en- 
tropy, and there is in fact no need to employ an and-or 
tree at all, since the tree-cutting criterion becomes local 
to the parse tree being cut up. 

In a slightly more elaborate scheme, we sum over the 
entropies of the nodes of the parse trees that match this 
node of the and-or tree. However, instead of letting each 
daughter node contribute with the full entropy of the 
LHS phrase of the corresponding grammar rule, these 
entropies are weighted with the relative frequency of 
use of each alternative choice of grammar rule. 

For example, the entropy of node n.3 of the and- 
or tree of Figure [2] will be calculated as follows: The 
mother rule vp_v_np will contribute the entropy asso- 
ciated with the RHS NP, which is, referring to the table 
above, 0.64. There are 2 choices of rules to resolve on, 
namely np_det_n and np_np_pp with relative frequen- 
cies I and I respectively. Again referring to the entropy 
table above, we find that the LHS phrases of these rules 
have entropy 1.33 and 0.00 respectively. This results in 



the following entropy for node n3: 

S(n 3 ) = 0.64 + - • 1.33 + \ ■ 0.00 = 1.08 

The following function determines the set of cutnodes 
N that either exceed the entropy threshold, or are in- 
duced by structural equivalence: 

Function N(S m i n ) 

1. N := {n : S(n) > S min }; 

2. Return N*{N); 

Here S(n) is the entropy of node n. 

In a third version of the scheme, the relative frequen- 
cies of the daughters of the or-nodes are used directly 
to calculate the node entropy: 

S( n ) — X! - P( n i\ n ) ■ lnp(m\ri) 

ni '■ (n,rii)^A 



Here A is the set of arcs, and (n, rij) is an arc from n to 
rij. This is basically the entropy used in Quinlan 1986 



Unfortunately, this tends to promote daughters of cut- 
nodes to in turn become cutnodes, and also results in a 
problem with instability, especially in conjunction with 
the additional constraints discussed in a later section, 
since the entropy of each node is now dependent on the 
choice of cutnodes. We must redefine the function N(S) 
accordingly: 



Function N(S m i n ) 

1. TV := 0; 

2. Repeat i := i+ 1; 

TV := {n : SHiV 1 - 1 ) > S roin }; iV* := N*(N); 

3. Until N l = iV'- 1 

4. Return TV'; 

Here S^N^) is the entropy of node n given that the 
set of cutnodes is . Convergence can be ensured^] by 
modifying the termination criterion to be 

3. Until 3j e [0,i - 1] : p(N l ,N^) < 6(N\N') 

for some appropriate set metric p(Ni, N 2 ) (e.g. the size 
of the symmetric difference) and norm-like function 
S(Ni,N 2 ) (e.g. ten percent of the sum of the sizes), 
but this is to little avail, since we are not interested in 
solutions far away from the initial assignment of cut- 
nodes. 

Finding the threshold 

We will use a simple interval-bisection technique for 
finding the appropriate threshold value. We operate 
with a range where the lower bound gives at least the 
desired coverage, but where the higher bound doesn't. 
We will take the midpoint of the range, find the cut- 
nodes corresponding to this value of the threshold, and 
check if this gives us the desired coverage. If it does, 
this becomes the new lower bound, otherwise it becomes 
the new upper bound. If the lower and upper bounds 
are close to each other, we stop and return the nodes 
corresponding to the lower bound. This termination cri- 
terion can of course be replaced with something more 
elaborate. This can be implemented as follows: 

Function N(C ) 

1. Si ow '■= 0; Shigh ■= largenumber; N c := N(0); 

2. If Shigh — Si ow < Ss 
then goto || 

else <? -j — S lon ,+S high . 

3. N := N(S mid ); 

4. If C(N) < Co 
then Shigh := S m id 

else Si^ '■= S mid ; N c := N; 

5. Goto|; 

6. Return N c ; 

Here C(N) is the coverage on the test set of the spe- 
cialized grammar determined by the set of cutnodes N. 

Actually, we also need to handle the boundary case 
where no assignment of cutnodes gives the required cov- 
erage. Likewise, the coverages of the upper and lower 

6 albeit in exponential time 



bound may be far apart even though the entropy dif- 
ference is small, and vice versa. These problems can 
readily be taken care of by modifying the termination 
criterion, but the solutions have been omitted for the 
sake of clarity. 

In the running example, using the weighted sum of 
the phrase entropies as the node entropy, if any thresh- 
old value less than 1.08 is chosen, this will yield any 
desired coverage, since the single test example of Fig- 
ure [j] is then covered. 

Retrieving the specialized rules 

When retrieving the specialized rules, we will match 
each training example against the and-or tree. If the 
current node is a cutnode, we will cut at this point in 
the training example. The resulting rules will be the 
set of cut-up training examples. A threshold value of 
say 1.00 in our example will yield the set of cutnodes 
{77,3, 77,4, rig, ng} and result in the set of specialized rules 
of Figure ||. 

If we simply let the and-or tree determine the set 
of specialized rules, instead of using it to cut up the 
training examples, we will in general arrive at a larger 
number of rules, since some combinations of choices in 
the and-or tree may not correspond to any training ex- 
ample. If this latter strategy is used in our example, 
this will give us the two extra rules of Figure [|. Note 
that they not correspond to any training example. 

ADDITIONAL CONSTRAINTS 

As mentioned at the beginning, the specialized gram- 
mar is compiled into LR parsing tables. Just finding 
any set of cutnodes that yields the desired coverage 
will not necessarily result in a grammar that is well 
suited for LR parsing. In particular, LR parsers, like 
any other parsers employing a bottom-up parsing strat- 
egy, do not blend well with empty productions. This is 
because without top-down filtering, any empty produc- 
tion is applicable at any point in the input string, and a 
naive bottom-up parser will loop indefinitely. The LR 
parsing tables constitute a type of top-down filtering, 
but this may not be sufficient to guarantee termination, 
and in any case, a lot of spurious applications of empty 
productions will most likely take place, degrading per- 
formance. For these reasons we will not allow learned 
rules whose RHSs are empty, but simply refrain from 
cutting in nodes of the parse trees that do not dominate 
at least one lexical lookup. 

Even so, the scheme described this far is not totally 
successful, the performance is not as good as using 
hand-coded tree-cutting criteria. This is conjectured 
to be an effect of the reduction lengths being far too 
short. The first reason for this is that for any spurious 



1) "S => Det N V Prep NP" 

s_np_vp 
/\ 
/ \ 

np_det_n vp_vp_pp 

/\ A 
lex lex vp_v pp_prep_np 
I A 
lex lex NP 

2) "S => Pron V NP" 

s_np_vp 
A 

np_pron vp_v_np 
I A 
lex lex NP 

3) "NP => Det N" 

np_det_n 

A 

lex lex 

4) "NP => NP Prep NP" 

np_np_pp 

A 

NP pp_prep_np 
A 
lex NP 

5) "NP => Num" 

np_num 
I 

lex 



Figure 3: The specialized rules 



6) "S => Det N V NP" 

s_np_vp 
A 

np_det_n vp_v_np 

A A 
lex lex lex NP 

7) "S => Pron V Prep NP" 

s_np_vp 
A 

np_pron vp_vp_pp 
I /\ 
lex vp_v pp_prep_np 
I /\ 
lex lex NP 



Figure 4: Additional specialized rules 



rule reduction to take place, the corresponding RHS 
phrases must be on the stack. The likelihood for this to 
happen by chance decreases drastically with increased 
rule length. A second reason for this is that the number 
of states visited will decrease with increasing reduction 
length. This can most easily be seen by noting that the 
number of states visited by a deterministic LR parser 
equals the number of shift actions plus the number of 
reductions, and equals the number of nodes in the cor- 
responding parse tree, and the longer the reductions, 
the more shallow the parse tree. 

The hand-coded operationality criteria result in an 
average rule length of four, and a distribution of reduc- 
tion lengths that is such that only 17 percent are of 
length one and 11 percent are of length two. This is in 
sharp contrast to what the above scheme accomplishes; 
the corresponding figures are about 20 or 30 percent 
each for lengths one and two. 

An attempted solution to this problem is to impose 
restrictions on neighbouring cutnodes. This can be 
done in several ways; one that has been tested is to 
select for each rule the RHS phrase with the least en- 
tropy, and prescribe that if a node corresponding to the 
LHS of the rule is chosen as a cutnode, then no node 
corresponding to this RHS phrase may be chosen as a 
cutnode, and vice versa. In case of such a conflict, the 
node (class) with the lowest entropy is removed from 
the set of cutnodes. 

We modify the function TV* to handle this: 

2. Repeat i :=i + 1; N* := A(A 4 - X ) \ B(N 1 ^)- 

Here B(N : >) is the set of nodes in that should be re- 
moved to avoid violating the constraints on neighbour- 
ing cutnodes. It is also necessary to modify the termi- 
nation criterion as was done for the function N(S m i n ) 
above. Now we can no longer safely assume that the 
coverage increases with decreased entropy, and we must 
also modify the interval-bisection scheme to handle this. 
It has proved reasonable to assume that the coverage 
is monotone on both sides of some maximum, which 
simplifies this task considerably. 

EXPERIMENTAL RESULTS 

A module realizing this scheme has been implemented 
and applied to the very setup used for the previous ex- 
periments with the hand-coded tree-cutting criteria, see 



Samuelsson 1994a]. 2100 of the verified parse trees con- 



stituted the training set, while 230 of them were used 
for the test set. The table below summarizes the re- 
sults for some grammars of different coverage extracted 
using: 

1. Hand-coded tree-cutting criteria. 



2. Induced tree-cutting criteria where the node entropy 
was taken to be the phrase entropy of the RHS phrase 
of the dominating grammar rule. 

3. Induced tree-cutting criteria where the node entropy 
was the sum of the phrase entropy of the RHS phrase 
of the dominating grammar rule and the weighted 
sum of the phrase entropies of the LHSs of the alter- 
native choices of grammar rules to resolve on. 



In the latter two cases experiments were carried out 
both with and without the restrictions on neighbouring 
cutnodes discussed in the previous section. 



Hand-coded tree-cutting criteria 


Coverage 


Reduction lengths (%) 
1 2 3 > 4 


Times (ms) 
Ave. Med. 


90.2 % 


17.3 11.3 21.6 49.8 


72.6 48.0 



RHS phrase entropy Neighbour restrictions 


Coverage 


Reduction lengths (%) 
1 2 3 > 4 


Times (ms) 
Ave. Med. 


75.8 % 
80.5 % 
85.3 % 


11.8 26.1 17.7 44.4 
11.5 27.4 20.0 41.1 
14.0 37.3 24.3 24.4 


128 38.5 
133 47.2 
241 70.5 




RHS phrase entropy No neighbour restrictions 


Coverage 


Reduction lengths (%) 
1 2 3 > 4 


Times (ms) 
Ave. Med. 


75.8 % 
79.7 % 
85.3 % 

90.9 % 


8.3 12.4 25.6 53.7 
9.0 16.2 26.9 47.9 

8.4 17.3 31.1 43.2 
18.2 27.5 21.7 32.6 


76.7 37.0 
99.1 49.4 
186 74.0 
469 126 




Mixed phrase entropies. Neighbour restrictions 


Coverage 


Reduction lengths (%) 
1 2 3 > 4 


Times (ms) 
Ave. Med. 


75.3 % 


6.1 11.7 30.8 51.4 


115.4 37.5 




Mixed phrase entropies. No neighbour restrictions 


Coverage 


Reduction lengths (%) 
1 2 3 > 4 


Times (ms) 
Ave. Med. 


75 % 
80 % 


16.1 13.8 19.8 50.3 
18.3 16.3 20.1 45.3 


700 92.0 
842 108 



With the mixed entropy scheme it seems important 
to include the restrictions on neighbouring cutnodes, 
while this does not seem to be the case with the RHS 
phrase entropy scheme. A potential explanation for the 
significantly higher average parsing times for all gram- 
mars extracted using the induced tree-cutting criteria 
is that these are in general recursive, while the hand- 
coded criteria do not allow recursion, and thus only 
produce grammars that generate finite languages. 

Although the hand-coded tree-cutting criteria are 
substantially better than the induced ones, we must 



remember that the former produce a grammar that in 
median allows 60 times faster processing than the orig- 
inal grammar and parser do. This means that even if 
the induced criteria produce grammars that are a fac- 
tor two or three slower than this, they are still approx- 
imately one and a half order of magnitude faster than 
the original setup. Also, this is by no means a closed 
research issue, but merely a first attempt to realize the 
scheme, and there is no doubt in my mind that it can 
be improved on most substantially. 

SUMMARY 

This article proposes a method for automatically find- 
ing the appropriate tree-cutting criteria in the EBG 
scheme, rather than having to hand-code them. The 
EBG scheme has previously proved most successful for 
tuning a natural-language grammar to a specific ap- 
plication domain and thereby achieve very much faster 
parsing, at the cost of a small reduction in coverage. 

Instruments have been developed and tested for con- 
trolling the coverage and for avoiding a large number 
of short reductions, which is argued to be the main 
source to poor parser performance. Although these 
instruments are currently slightly too blunt to enable 
producing grammars with the same high performance 
as the hand-coded tree-cutting criteria, they can most 
probably be sharpened by future research, and in par- 
ticular refined to achieve the delicate balance between 
high coverage and a distribution of reduction lengths 
that is sufficiently biased towards long reductions. Also, 
banning recursion by category specialization, i.e. by for 
example distinguishing NPs that dominate other NPs 
from those that do not, will be investigated, since this is 
believed to be an important ingredient in the version of 
the scheme employing hand-coded tree-cutting criteria. 
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